Search CORE

178 research outputs found

Over-Fitting in Model Selection with Gaussian Process Regression

Author: A Girard
CE Rasmussen
DJC MacKay
G Walter
GC Cawley
GC Cawley
J Bernardo
J Demšar
JQ Shi
M Abramowitz
T Dietterich
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Model selection in Gaussian Process Regression (GPR) seeks to determine the optimal values of the hyper-parameters governing the covariance function, which allows flexible customization of the GP to the problem at hand. An oft-overlooked issue that is often encountered in the model process is over-fitting the model selection criterion, typically the marginal likelihood. The over-fitting in machine learning refers to the fitting of random noise present in the model selection criterion in addition to features improving the generalisation performance of the statistical model. In this paper, we construct several Gaussian process regression models for a range of high-dimensional datasets from the UCI machine learning repository. Afterwards, we compare both MSE on the test dataset and the negative log marginal likelihood (nlZ), used as the model selection criteria, to find whether the problem of overfitting in model selection also affects GPR. We found that the squared exponential covariance function with Automatic Relevance Determination (SEard) is better than other kernels including squared exponential covariance function with isotropic distance measure (SEiso) according to the nLZ, but it is clearly not the best according to MSE on the test data, and this is an indication of over-fitting problem in model selection

Crossref

University of East Anglia digital repository

Wind Data Mining by Kohonen Neural Networks

Author: A Ultsch
AF Jenkinson
Ananth Grama
C Marzban
Carolina Fayos
D Chen
GC Cawley
HH Lamb
HH Lamb
José Fayos
KR Briffa
M Hulme
PD Jones
R Kretzschmar
T Kohonen
TD Davies
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Time series of Circulation Weather Type (CWT), including daily averaged wind direction and vorticity, are self-classified by similarity using Kohonen Neural Networks (KNN). It is shown that KNN is able to map by similarity all 7300 five-day CWT sequences during the period of 1975–94, in London, United Kingdom. It gives, as a first result, the most probable wind sequences preceding each one of the 27 CWT Lamb classes in that period. Inversely, as a second result, the observed diffuse correlation between both five-day CWT sequences and the CWT of the 6(th) day, in the long 20-year period, can be generalized to predict the last from the previous CWT sequence in a different test period, like 1995, as both time series are similar. Although the average prediction error is comparable to that obtained by forecasting standard methods, the KNN approach gives complementary results, as they depend only on an objective classification of observed CWT data, without any model assumption. The 27 CWT of the Lamb Catalogue were coded with binary three-dimensional vectors, pointing to faces, edges and vertex of a “wind-cube,” so that similar CWT vectors were close

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Digital.CSIC

Integrating phenotype and gene expression data for predicting gene function

Author: A Kahraman
Affymetrix
Andy D Perkins
Brandon M Malone
DW Huang
GC Cawley
J Rodgers
JD Wren
L In-Yee
M Ashburner
M Ashburner
M Steinbach
N Daraselia
NCBI
P Groth
P Groth
Rosetta Biosoftware
Susan M Bridges
U Karaoz
Y Zhao
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Injecting Domain Knowledge in Electronic Medical Records to Improve Hospitalization Prediction

Author: AG Salguero
BA Goldstein
CC Chang
F Pedregosa
FJ Ordónez
G Forman
GC Cawley
H Min
J Bergstra
L Breiman
P McCullagh
V Lacroix-Hugues
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/06/2019
Field of study

International audienceElectronic medical records (EMR) contain key information about the different symptomatic episodes that a patient went through. They carry a great potential in order to improve the well-being of patients and therefore represent a very valuable input for artificial intelligence approaches. However, the explicit knowledge directly available through these records remains limited, the extracted features to be used by machine learning algorithms do not contain all the implicit knowledge of medical expert. In order to evaluate the impact of domain knowledge when processing EMRs, we augment the features extracted from EMRs with ontological resources before turning them into vectors used by machine learning algorithms. We evaluate these augmentations with several machine learning algorithms to predict hospitalization. Our approach was experimented on data from the PRIMEGE PACA database that contains more than 350,000 consultations carried out by 16 general practitioners (GPs)

Crossref

INRIA a CCSD electronic archive server

A new framework for sign language alphabet hand posture recognition using geometrical features through artificial neural network (part 1)

Author: A Anand
CC Chang
CW Hsu
D Janez
G Aquino
G Fang
GC Cawley
H Han
H Liang
H-S Yeo
I Elias
IV Tetko
J De Jesús Rubio
K Assaleh
K Crammer
K-J Yoon
N Bahman
N Duta
P Garg
P Kauff
P Kishore
PVV Kishore
Q-S Zhu
RKE Yo
RR Sanchez
S Shamshirband
SS Rautaray
T Fawcett
TG Dietterich
W Xiong
X Wu
Z Luca
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Hand pose tracking is essential in sign languages. An automatic recognition of performed hand signs facilitates a number of applications, especially for people with speech impairment to communication with normal people. This framework which is called ASLNN proposes a new hand posture recognition technique for the American sign language alphabet based on the neural network which works on the geometrical feature extraction of hands. A user’s hand is captured by a three-dimensional depth-based sensor camera; consequently, the hand is segmented according to the depth analysis features. The proposed system is called depth-based geometrical sign language recognition as named DGSLR. The DGSLR adopted in easier hand segmentation approach, which is further used in segmentation applications. The proposed geometrical feature extraction framework improves the accuracy of recognition due to unchangeable features against hand orientation compared to discrete cosine transform and moment invariant. The findings of the iterations demonstrate the combination of the extracted features resulted to improved accuracy rates. Then, an artificial neural network is used to drive desired outcomes. ASLNN is proficient to hand posture recognition and provides accuracy up to 96.78% which will be discussed on the additional paper of this authors in this journal

LJMU Research Online (Liverpool John Moores University)

Crossref

Universiti Teknologi Malaysia Institutional Repository

Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm

Author: A Agrawal
A Jain
A Jain
A Kabiraj
A Mansouri Tehrani
AA Emery
B Meredig
C Bernau
C Chen
CB Cooper
CL Clement
D Krstajic
F Faber
F Pedregosa
F Ren
G Petretto
GC Cawley
H Hotelling
HS Stein
I Petousis
IE Castelli
J Schmidt
JJ Heckman
JP Perdew
K Choudhary
K Choudhary
L Breiman
L Ward
L Ward
M de Jong
M Stone
NN Kiselyova
P Hohenberg
R Jose
SP Ong
T Xie
W Kohn
Y Liu
Y Zhang
Y Zhuo
Z Wu
Z Xiong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/05/2020
Field of study

We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13 ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a materials composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully-automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm - namely, that crystal graph methods appear to outperform traditional machine learning methods given ~10^4 or greater data points. The pre-processed, ready-to-use Matbench tasks and the Automatminer source code are open source and available online (http://hackingmaterials.lbl.gov/automatminer/). We encourage evaluating new materials ML algorithms on the MatBench benchmark and comparing them against the latest version of Automatminer.Comment: Main text, supplemental inf

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Occupancy Classification of Position Weight Matrix-Inferred Transcription Factor Binding Sites

Author: A Barski
A Valouev
Aaron Cohen
B Lenhard
CC Chang
D Karolchik
DH Wolpert
DL Daniels
E Roulet
FN Jensen
G Cooper
G Pavesi
G Robertson
GC Prendergast
GD Stormo
Gregory Yochum
Hollis Wright
IH Witten
Indra Neil Sarkar
J Cohen
JE Darnell
Kemal Sönmez
KI Zeller
KJ Won
M Tompa
MA Hall
N Friedman
ND Heintzman
OJ Sansom
P Hatzis
PJ Collins
Q Sun
R Staden
S Cawley
S Sinha
Shannon McWeeney
SL Schreiber
TL Bailey
TY Roh
UM Fayyad
V Matys
VN Vapnik
Y Chen
YJ Shann
Publication venue: Public Library of Science
Publication date: 04/11/2011
Field of study

BACKGROUND: Computational prediction of Transcription Factor Binding Sites (TFBS) from sequence data alone is difficult and error-prone. Machine learning techniques utilizing additional environmental information about a predicted binding site (such as distances from the site to particular chromatin features) to determine its occupancy/functionality class show promise as methods to achieve more accurate prediction of true TFBS in silico. We evaluate the Bayesian Network (BN) and Support Vector Machine (SVM) machine learning techniques on four distinct TFBS data sets and analyze their performance. We describe the features that are most useful for classification and contrast and compare these feature sets between the factors. RESULTS: Our results demonstrate good performance of classifiers both on TFBS for transcription factors used for initial training and for TFBS for other factors in cross-classification experiments. We find that distances to chromatin modifications (specifically, histone modification islands) as well as distances between such modifications to be effective predictors of TFBS occupancy, though the impact of individual predictors is largely TF specific. In our experiments, Bayesian network classifiers outperform SVM classifiers. CONCLUSIONS: Our results demonstrate good performance of machine learning techniques on the problem of occupancy classification, and demonstrate that effective classification can be achieved using distances to chromatin features. We additionally demonstrate that cross-classification of TFBS is possible, suggesting the possibility of constructing a generalizable occupancy classifier capable of handling TFBS for many different transcription factors

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

L2-norm multiple kernel learning and its application to biomedical data fusion

Author: A Daemen
A Daemen
Anneleen Daemen
AY Ng
B Schölkopf
Bart De Moor
C Bottomley
C Leslie
DMJ Tax
ED Andersen
FR Bach
G Condous
G Thomas
GC Cawley
GRG Lanckriet
GRG Lanckriet
J Gudmundsson
J Shawe-Taylor
JAK Suykens
JAK Suykens
Johan AK Suykens
JP Ye
K Tretyakov
K Veropoulos
Leon-Charles Tranchevent
M Grant
M Grant
M Kloft
M Kloft
M Kowalski
O Gevaert
R Hettich
R Reemtsen
RA Eeles
S Aerts
S Sonnenburg
S Yu
Shi Yu
SJ Kim
T De Bie
T van den Bosch
Tillmann Falck
V Vapnik
Y Zheng
Yves Moreau
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background This paper introduces the notion of optimizing different norms in the dual problem of support vector machines with multiple kernels. The selection of norms yields different extensions of multiple kernel learning (MKL) such as <it>L</it>∞, <it>L</it>1, and <it>L</it>2 MKL. In particular, <it>L</it>2 MKL is a novel method that leads to non-sparse optimal kernel coefficients, which is different from the sparse kernel coefficients optimized by the existing <it>L</it>∞ MKL method. In real biomedical applications, <it>L</it>2 MKL may have more advantages over sparse integration method for thoroughly combining complementary information in heterogeneous data sources. Results We provide a theoretical analysis of the relationship between the <it>L</it>2 optimization of kernels in the dual problem with the <it>L</it>2 coefficient regularization in the primal problem. Understanding the dual <it>L</it>2 problem grants a unified view on MKL and enables us to extend the <it>L</it>2 method to a wide range of machine learning problems. We implement <it>L</it>2 MKL for ranking and classification problems and compare its performance with the sparse <it>L</it>∞ and the averaging <it>L</it>1 MKL methods. The experiments are carried out on six real biomedical data sets and two large scale UCI data sets. <it>L</it>2 MKL yields better performance on most of the benchmark data sets. In particular, we propose a novel <it>L</it>2 MKL least squares support vector machine (LSSVM) algorithm, which is shown to be an efficient and promising classifier for large scale data sets processing. Conclusions This paper extends the statistical framework of genomic data fusion based on MKL. Allowing non-sparse weights on the data sources is an attractive option in settings where we believe most data sources to be relevant to the problem at hand and want to avoid a "winner-takes-all" effect seen in <it>L</it>∞ MKL, which can be detrimental to the performance in prospective studies. The notion of optimizing <it>L</it>2 kernels can be straightforwardly extended to ranking, classification, regression, and clustering algorithms. To tackle the computational burden of MKL, this paper proposes several novel LSSVM based MKL algorithms. Systematic comparison on real data sets shows that LSSVM MKL has comparable performance as the conventional SVM MKL algorithms. Moreover, large scale numerical experiments indicate that when cast as semi-infinite programming, LSSVM MKL can be solved more efficiently than SVM MKL. Availability The MATLAB code of algorithms implemented in this paper is downloadable from <url>http://homes.esat.kuleuven.be/~sistawww/bioi/syu/l2lssvm.html</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Candidate gene prioritization by network analysis of differential expression using machine learning approaches

Author: A Subramanian
A Zanzoni
AJ Smola
AP Francisco
B Aranda
B Harr
Bart de Moor
C Saunders
C Stark
C von Mering
D Nitsch
D Zieker
Daniela Nitsch
F Chung
F Fouss
Fabian Ojeda
GC Cawley
GD Bader
H Yang
HY Chuang
J Chen
JA Hanley
Joana P Gonçalves
JW Park
K Lage
KR Brown
L Franke
L Gautier
L Salwinski
LC Tranchevent
M Liu
P Baldi
P Pagel
R Gupta
RA Irizarry
RI Kondor
RK Nibbe
S Aerts
S Köhler
S Mirkin
S Razick
S Vardhanabhuti
SE Choe
T Fawcett
WK Lim
Y Saad
Yves Moreau
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Discovering novel disease genes is still challenging for diseases for which no prior knowledge - such as known disease genes or disease-related pathways - is available. Performing genetic studies frequently results in large lists of candidate genes of which only few can be followed up for further investigation. We have recently developed a computational method for constitutional genetic disorders that identifies the most promising candidate genes by replacing prior knowledge by experimental data of differential gene expression between affected and healthy individuals. To improve the performance of our prioritization strategy, we have extended our previous work by applying different machine learning approaches that identify promising candidate genes by determining whether a gene is surrounded by highly differentially expressed genes in a functional association or protein-protein interaction network. Results We have proposed three strategies scoring disease candidate genes relying on network-based machine learning approaches, such as kernel ridge regression, heat kernel, and Arnoldi kernel approximation. For comparison purposes, a local measure based on the expression of the direct neighbors is also computed. We have benchmarked these strategies on 40 publicly available knockout experiments in mice, and performance was assessed against results obtained using a standard procedure in genetics that ranks candidate genes based solely on their differential expression levels (<it>Simple Expression Ranking</it>). Our results showed that our four strategies could outperform this standard procedure and that the best results were obtained using the <it>Heat Kernel Diffusion Ranking </it>leading to an average ranking position of 8 out of 100 genes, an AUC value of 92.3% and an error reduction of 52.8% relative to the standard procedure approach which ranked the knockout gene on average at position 17 with an AUC value of 83.7%. Conclusion In this study we could identify promising candidate genes using network based machine learning approaches even if no knowledge is available about the disease or phenotype.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Machine learning on normalized protein sequences

Author: A Altmann
A Kernytsky
AE Karnoub
AK Patick
B Liu
B Liu
C Strobl
C Torti
D Heider
D Heider
D Heider
D Wang
Daniel Hoffmann
DJ Kempf
Dominik Heider
F Wilcoxon
GC Cawley
GE Forsythe
GM Pao
H Lodhi
I Dubchak
IR Vetter
J Demsar
J Kjaer
J Kyte
J Pánek
Jens Verheyen
JN Dybowski
K Wang
KC Chou
L Breiman
L Nanni
M Borschbach
M Kierczak
M Kozisek
MA Jensen
ME Quinones-Mateu
N Beerenwinkel
N Beerenwinkel
N Beerenwinkel
N Qian
NS Shulman
O Haq
P Chowriappa
P Mundra
R Colonno
S Boisvert
S Ong
S Sonnenburg
S Xu
SY Rhee
T Fawcett
T Hou
T Sing
TB Thompson
V Svetnik
W Resch
Y Guo
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. Findings We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. Conclusions We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central